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Abstract 

X-ray crystallography is the predominant nnethod for obtaining atomic-scale information about biological macromolecules. 
Despite the success of the technique, obtaining well diffracting crystals still critically limits going from protein to structure. 
In practice, the crystallization process proceeds through knowledge-informed empiricism. Better physico-chemical 
understanding remains elusive because of the large number of variables involved, hence little guidance is available to 
systematically identify solution conditions that promote crystallization. To help determine relationships between 
macromolecular properties and their crystallization propensity, we have trained statistical models on samples for 182 
proteins supplied by the Northeast Structural Genomics consortium. Gaussian processes, which capture trends beyond the 
reach of linear statistical models, distinguish between two main physico-chemical mechanisms driving crystallization. One is 
characterized by low levels of side chain entropy and has been extensively reported in the literature. The other identifies 
specific electrostatic interactions not previously described in the crystallization context. Because evidence for two distinct 
mechanisms can be gleaned both from crystal contacts and from solution conditions leading to successful crystallization, 
the model offers future avenues for optimizing crystallization screens based on partial structural information. The availability 
of crystallization data coupled with structural outcomes analyzed through state-of-the-art statistical models may thus guide 
macromolecular crystallization toward a more rational basis. 



Citation: Fusco D, Barnum TJ, Bruno AE, Luft JR, Snell EH, et al. (201 4) Statistical Analysis of Crystallization Database Links Protein Physico-Chemical Features with 
Crystallization Mechanisms. PLoS ONE 9(7): el01123. doi:10.1371/journal.pone.0101123 

Editor: Bostjan Kobe, University of Queensland, Australia 

Received December 19, 2013; Accepted June 3, 2014; Published July 2, 2014 

Copyright: © 2014 Fusco et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: Sample preparation and data acquisition were supported in part by the Protein Structure Initiative of the National Institutes of Health, NIGMS grant 
U54 GM094597. Crystallization screening results and meta data was supplied from the University of Buffalo Center for Computational Resources (CCR) through 
research supported by NIH R01GM088396 (EHS, JRL, and AEB). DF and PC acknowledge support from National Science Foundation Grant No. NSF DMR-1 055586. 
TJB acknowledges REU support from National Science Foundation Grant NSF CHE-1 062607. The funders had no role in study design, data collection and analysis, 
decision to publish, or preparation of the manuscript. 

Competing interests: The authors have declared that no competing interests exist. 
* Email: patrick.charbonneau@duke.edu 



introduction 

X-ray crystallography is the most frequently used technique to 
obtain structural information about biological macromolecules, 
currently accounting for more tlian 88% of the entries in the 
Protein Data Bank (PDB) [1]. However, as its name suggests, tlie 
method fundamentally relies on obtaining well diffracting crystals 
of the macromolecules or complexes of interest (generally termed 
proteins in the crystallographic context and used as such 
tiiroughout this paper). Some quantitative data on the success 
rate of crystallization comes from the Protein Structure Initiative 
(PSI). This program, initiated by the National Institutes of Health, 
has enabled high-throughput structural studies of biomolecules 
that track the experimental outcome, success or failure. Analysis of 
this data reveals that despite the large scale efforts, fewer than 30% 
of the proteins that are expressed and purified yield diffraction 



quality crystals, and of these only 67% provide structures (20% of 
the expressed and purified samples) [2,3]. There are currendy 
about 100,000 structures in the PDB [1], but more than 10 million 
non-redundant protein chain sequences have been reported [4]. 
The large number of proteins for which detailed structural 
knowledge remains unavailable is an ongoing challenge for high- 
throughput crystallization. 

The current approach to crystallization is empirical. Proteins 
are screened against arrays of many chemical conditions that are 
biologically "friendly" and have yielded crystals in the past [5]. As 
an example, the High-Throughput Crystallization Screening 
(HTS) laboratory at the Hauptman Woodward Medical Research 
Institute uses 1,536 different chemical conditions each aiming to 
reduce protein solubility so as to obtain ordered crystallization [6] . 
The approach yields at least one crystal in about 50% of the 
samples [7], but, from tracking the PSI supplied samples, only 
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about half of those initial crystal hits go on to yield subsequent 
structural information. This success rate is respectable in the field, 
although it should be noted that the result is not a per cocktail 
(crystallization experiment) statistic, but a binary analysis on the 
presence of a crystal within 1,536 different experiments. Out of all 
the screening performed in the HTS laboratory it is estimated that 
only 0.2% of individual crystaUization screening conditions yield a 
crystal; failure is unfortunately all too common [8] . One may thus 
hope that an improved physico-chemical understanding of prot(-in 
crystallization could help navigate the chemical screening space 
more efficiendy [9]. 

The positive and negative outcome data captured by the PSI 
and similar structural genomics (SG) efforts have been employed 
to determine the key factors that affect protein crystallization. A 
number of studies have used amino acid sequence features as 
inputs to machine-learning classification schemes, in order to 
identify proteins that should easily crystallize and thus be good SG 
targets [10-14]. This strategy faces two main diSiculties. First, it 
typically reUes on a protein's amino acid sequence, which is only 
indirectly related to crystal assembly. Surface residues are more 
directly linked to the crystallization process [15], but are 
challenging to determine with high fidelity de novo. Second, topical 
machinc-k'arning methods based on Support Vector Machine, 
which divide the feature space between different classes of 
macromolecules [16], are deterministic and can be hard to 
interpret physically. The complexity of the function that separates 
positive and negative regions of parameter space typically hinders 
the physico-chemical interpretation of the results and thus the 
transfer of microscopic insights to applications beyond crystal 
formation, such as peptide design [9,12,17,18]. 

Two statistical inference models trained on a richer set of 
protein features have gone beyond these difEculties [19,20]. Both 
of them find that low values of surface side chain entropy (related 
to the degrees of freedom of the surface residues) and a high 
fraction of small surface residues, such as glycine and alanine, 
assist crystallization. They thus support surface entropy reduction 
(SER) mutagenesis, which broadly prescribes replacing large 
residues, e.g. lysines and glutamic acids, with alanines [21,22], in 
order to facilitate crystallization. Yet the two modeling approaches 
use somewhat orthogonal algorithms for predicting crystallization 
propensity and their results do not always agree [12]. 

Part of this discrepancy may come from the linearity of the 
underlying models. Linear models have the advantage of being 
easily interpretable, but they struggle to capture subde non-linear 
and possibly non-monotonic trends, which can make them 
sensitive to the details of the training set. However, protein 
crystaUization responds non-linearly to changes in solution 
conditions. Extremely low values of side chain entropy indeed 
hinder crystallization by compromising protein solubility, as 
observed in experiments [23] and as predicted in solubility models 
[24]. George and Wilson also carefully documented the non- 
linearity of protein ('r\stallization by identifying the range of 
second virial coefficient (not too high, not too low) over which 
proteins typically crystallize [25], a result that is fundamental to 
the materials physics understanding of protein assembly [26-30]. 

In this work, we have used a subset of the screening results from 
the North East Structural Genomics consortium (nesg) to train 
models based on Gaussian processes (GP). GP replace specific 
constraints on the functional form of the model with a prior 
distributions that weighs all of the (infinite) smooth functions [31], 
and can thus better capture the non-linear and non-monotonic 
relations in a dataset. The resulting models help us address two 
fundamental questions about protein crystallization, (i) What 
protein properties determine crystallization propensity in standard 



screens? (ii) How do these properties relate to successful 
crystaUization conditions? Answering (i) enhances existing muta- 
genesis prescriptions to facflitate the crystaUization of recalcitrant 
proteins without denaturing their structure; answering (ii) suggests 
guidelines for tailoring and narrowing the set of solution concUtions 
for crystaUizing a given protein. 

Results and Discussion 

The dataset provided by the nesg contains information about 
182 distinct proteins that were supplied in a common buffer and 
set up against an array of 1,536 different chemical cocktaUs 
representing an extensive set of crystallization conditions (see 
Methods). The different microbatch under-oU experiments were 
imaged over time and each outcome was visually classified as 
containing a crystal or not. Protein structures were subsequentiy 
determined using X-ray crystallography by the nesg. In this 
dataset, a broad range of crystallization propensity, ^, defined as 
the fraction of the 1,536 cocktails that successfuUy generated 
crystals, is observed. Two proteins formed crystals in as many as 
30% of the tested conditions, hut most did so in only a few of the 
solutions (Fig. 1). The biiiar)- classification between crystal or no 
crystEil does not distinguish between the stochastic nucleation 
process and the crystal growth process, once nucleation has 
occurred. Both must have happened to produce a crystal. WhUe 
we may have false negatives that could be redu< cd In- replication 
of the crystaUization screening process, the large- range of related 
chemical conditions and the fairly large number of samples studied 
should largely mitigate this effect. Some proteins may nucleate 
more easily than others, but once nucleation occurs, growth 
follows and the dataset records this outcome. Nucleation could 
perhaps be deconvoluted from crystaUization by recording the 
number of crystals produced per chemical condition, but it has not 
been attempted for this study. The term crystallization in our case 
therefore necessarily indicates both crystal nucleation and crystal 
growth. 

GPR: Crystallization propensity 

We identified some of the factors behind facile protein 
crystaUization by training a Gaussian process regression (GPR) 
model for £, using a set of physico-chemical properties as predictive 
variables x. (For mathematical convenience the output function 
fix) of the GPR model is chosen to be / = t^/(l— (instead of 
f = £), but the uniqueness of this transformation and of its inverse, 
<^=//(l +/), results in no loss of generality. See also Methods for 
more detaUs.) The flexibUity of GP enables GPR models to capture 
any continuous relationship, no matter how complex, between x 
and the output function /. In order to avoid overfitting the model 
parameters, we optimized the model marginal log-likelihood (Eq. 
(8)), which rewards good fitting of the data whUe penalizing overly 
complex models. The training process selects one specific function 
that best captures the effect of the predictive variables x. Because 
many local maxima of the marginal log-likelihood can be found 
over the parameter space, a broad search is necessary. An 
exhaustive sampling is out of computational reach, but the largest 
maximum we located is also the best performer in leave-one-out 
(LOO) cross-validation (Fig. 2 A and B). This standard diagnostic 
tool for overfitting [3 1] indicates that the choice of parameters is 
reasonably representative of the best model. A direct comparison 
reveals that the resulting GPR model recovers the observed data 
more precisely than linear regression (LR) in 74% of the NESG 
proteins, with GPR performing consistendy better for proteins 
with a moderate-to-high crystallization propensity (Fig. 3). We thus 
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Figure 1. Summary of some of the features of the proteins contained in the nesg dataset. Histograms report the pi, average hydropathy 
index (GRAVY), average side chain entropy (SCE), average polarity (POL). Both the overall (blue) and the surface (cyan) valus are shown for the GRAVY 
index, side chain entropy (SCE) and polarity (POL). Distribution of the surface coverage for each amino acid and histogram of the crystallization 
propensity c are also shown. 
doi:1 0.1 371 /journal.pone.OI 01 1 23.g001 



confirm that a non-linear model better relates a protein's 
crystallization propensity to its physico-chemical properties. 

The significance of specific predictive variables in a GPR model 
can be assessed by the magnitude of their corresponding learned 
length scale / in the GP kernel function (see Methods). A small / 
indicates that the model has a high sensitivity to a specific 
property, and vice versa. In that sense / plays a role similar to a 
weight in a LR model, but is unsigned. Determining whether a 
variable is positively or negatively correlated with the output of the 
model requires a local analysis of its predictions. 

Comparing / for the different protein surface residues indicates 
that the most significant residues are the aromatics (phenylalanine 
(F), tyrosine (Y), tryptophan (W), and proline (P)) as well as cysteine 
(C) and glutamic acid (E) (Fig. 4). The importance of phenylal- 
anine and glutamic acid was uncovered in earlier studies [19], but 
that of the other aromatic residues and of cysteine had previously 
gone undetected. The contrast between the (small) / associated 



with aromatic residues, which are hydrophobic and large, and the 
(large) / associated with small hydrophobic residues, i.e., leucine 
(L), isoleucine (I), and valine (V), further indicates that only large 
hydrophobic residues play a significant role, which is not all 
together surprising based on hydrophobicity arguments [32]. The 
case of cysteine is interesting. A recent protein crystallization 
engineering study found that replacing some residues with 
cysteines promotes crystal formation because of the residue's 
ability to form disulfide bonds and hence dimerize [33]. Yet that 
very reactivity can also result in noncrystalline aggregation [34]. 
The non-monotonicity resulting from the two competing behav- 
iors, which the GPR model here detects, may explain why earlier 
LR-based studies had not detected its importance [19]. 

At a coarser level, three surface residue categories - small, 
positively charged, and polar - are found to be significant (Fig. 4). 
Small residues enable the formation of favorable inter-protein 
backbone contacts and their low side chain entropy eases the 




model /obs 



Figure 2. A: Marginal log likelihood for 100 models that each are distinct local maxima of Eq. (8). The model with the highest value is used for the 

(J 

rest of the analysis. B: Scatter plot of the observed output function /obs = i — ^ and its value predicted by the GPR model /qpr using a LOO cross 

1 — c 

validation with 95% confidence intervals. The inset details the low propensity data. 
doi:1 0.1 371 /journal.pone.OI 01 1 23.g002 
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Figure 3. Difference between the observed and tKie modeled propensity values/. A: Points in the upper half of the graph correspond to 
the cases in which LR performs better than GPR, and vice versa for the \ower half B: The histogram summarizes the overall performance of LR and 
GPR. 
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formation of cry.stal contacts [19,20]. The general importance of 
surface side chain entropy (sSCE) further supports this interpre- 
tation (see Method for sSCE definition). The role of the other two 
residue categories is more controversial. Cieslik and Derewenda 
found that polarity strongly affects whether a residue belongs to a 
crystal contact [20], but Price et al. did not detect any significant 
contribution from individual polar residues [19]. Charged residues 
also have an ambiguous role. Lysine is thought to inhibit 
crystallization [19], while arginine has been suggested to facilitate 
crystallization in isolated instances [30,35,36]. Yet no correlation 
between arginine and crystallization propensity had thus far been 
noted. 

The asymmetry between positively and negatively charged 
residues in affecting crystallization propensity is puzzling, espe- 



cially because positive and negative residues are almost identically 
distributed over the surface of the proteins studied. One possible 
explanation is that the effect mirrors the asymmetry (slightly) 
favoring pH < 7 in the 1 ,536 condition screens (see Methods). This 
imbalance may neutralize the net charge of negative residues and 
therefore their ability to electrostatically affect inter-protein 
interactions. Another possible explanation comes from the 
asymmetry in water's charge distribution, strengthening interac- 
tions between water molecules and negative residues, and 
therefore favoring residue solvation [37]. The increased partici- 
pation of negatively-charged glutamic and aspartic acids com- 
pared to positively-charged residues in protein-protein interactions 
bridged by water supports this second scenario [38], but overall 
the evidence remains inconclusive. 



nesidue 



10' 



- 10' 



rTTTTMM 




GRAVY 



sGRAVY sSCE 
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Figure 4. Length scales / associated with the parameters for the maximal log-likelihood model (highest point in Fig. 2 A). The plots 
divide the explanatory variables in (A) residues, (B) residue type (small S, positively charged +, negatively charged — , polar P and hydrophobic H), (C) 
neighboring residue type (pairs of these symbols), and (D) other global protein properties (see Methods for details). 
doi:1 0.1 371/journal.pone.01 01 1 23.g004 
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The role of neighboring pairs of surface residues, although 
presumed to be significant [39], had not been previously directly 
assessed. We find that small-small (SS), small-polar (SP), and small- 
hydrophobic (SH) pairs as well as negative-positive (+ — ) and 
negative-polar ( — P) parrs markedly affect crystallization propen- 
sity (Fig. 4). The significance of the first three pair types reflects the 
enhanced role of small residues when coupled with specific residue 
types in promoting backbone-backbone (small-small), side chain- 
backbone (smaU-polar) and hydrophobic (small-hydrophobic) 
interactions. The importance of neighboring negative and positive 
residues is particularly interesting. Intra-chain pairing of these two 
residues can indeed suppress a potential source of favorable 
electrostatic inter-chain interactions. This mechanism has even 
been observed to hinder crystal formation in computational studies 
[30], pointing to a potential target for mutagenesis. 

Similarly to the case of individual charged residues, which 
shows an asymmetry between positive and negative side chains, 
negative-polar neighboring residues appear important, whereas 
positive-polar pairs are found to play no significant role. Proteins 
with a high crystallization propensity are significantly depleted in 
both of these pairings (see below and Table 1), suggesting that both 
hinder crystallization. We note, however, that retraining the GPR 
model without the positive residue category gives as much 
importance to positive-polar pairs as to negative-polar pairs. 
Similarly, retraining the model without the negative-polar 
category gives added importance to the negative residue category. 
These observations imply that a strong correlation exists between 
positive residues and positive-polar pairs as well as between 
negative residues and negative-polar pairs. Yet the correlation is 
incomplete. The importance of positive residues is indeed better 
captured when they are considered alone (10^ times more likely 
than the alternate model), and that of negative residues is better 
captured when they are coupled with polar residues (10^ times 
more likely than the alternate model). The source of this 
correlation and asymmetry remains unclear, and to the best of 
our knowledge, no physico-chemical explanation for this observa- 
tion has thus far been suggested. 

None of the other protein properties, including their isoelectric 
point (pi), significantiy affect crystallization, which is in line with a 
recent LR analysis of a similar dataset [19]. Although this finding 
may seem surprising based on earher reports that found the pi to 
be an important physical factor in protein crystallization [5,40,41], 
the discrepancy likely results from the selection bias of standard 
screens (like those analyzed here) in favor of conditions that are 
expected to reduce protein solubility for most proteins. These 
screens avoid pairing low-salt and extreme-pH conditions, which 
result in unscreened similarly charged proteins. Were these 
conditions present, they would likely statistically emphasize the 
physical importance of pi in protein crystallization. 



Table 1. Kolmogorov-Smirnov test between high and low 
crystallization propensity proteins in different sSCE-sGRAVY 
quadrants. 









Q1 


Q3 


Q4 


enriched 


P,PP,PH 


H,HH 




depleted 


-,S-,-H,-- 


S,SP,+P 


-P 



For the different quadrants, the list of properties for which easy to crystallize 
proteins (C>i^75%) are enriched for (or depleted in} compared to hard to 
crystallize proteins (t^<^25%). Symbols as in Fig. 4. 
doi:l 0.1 371/journal.pone.01 01 1 23.t001 



It is important to note that the GPR model is able to capture all 
the significant trends spotted by previous LR models [19]. In 
particular, it identifies the role of alanine and glycine (as small 
residues), of phenylalanine, and of sSCE in promoting crystalli- 
zation. The other variables that were identified as important in the 
LR model of Ref [19] but are not singled out by the GPR model, 
such as lysine and sGRAVY, were actually found to be redundant 
because of their strong correlation to sSCE [19]. This result 
highlights the elegance with which GPR handles correlations 
among the explanatory variables. 

In this respect, one correlation that is inherent to our choice of 
variables is that between the identity of specific residues and the 
residue category to which they belong. Because we find that 
residue categories impact the crystallization propensity more 
significantiy than most individual residues, we trained a second 
GPR model using only surface residue categories and pairs of 
surface residue categories as descriptive variables. This reduced 
GPR model performs very similarly to the complete version, and is 
also much better than LR (in 72% of the cases. Fig. 5). This 
analysis suggests that a coarsened description employing only 
residue categories could serve as a first approximation to 
understanding and tuning protein crystaUization using mutagen- 
esis. 

GPR: Independent crystallization mechanisms 

The complete GPR model also reveals the presence of 
crystallization hot spots, i.e., regions of x that give a high In 
order to locate these hot spots, we select the proteins from the nesg 
sample in the top 5 percentile for crystallization propensity 
((;>^95% = 0.I) as starting points for searching the protein 
property space. We specifically explore how ^ changes when 
moving away from these starting points by changing the surface 
residue composition (see Method section for search details). 
Figure 6 presents the hot spots projected on the sSCE and surface 
hydropathy (sGRAVY) plane. These two variables are thought to 
strongly influence the ease with which a protein crystallizes 
[20,38,42]. We divide the plane in quadrants using the nesg 
dataset's average sSCE and sGRAVY as delimiters. Interestingly, 
the model predicts high ^ regions in all four quadrants, but some 
of these regions are not biologically reasonable for actual proteins 
and should therefore be discarded. The crystaUization propensity ^ 
is indeed a mathematical object with no physical constraints, and 
hence can be defined for any combination of properties x. 
Propensity maxima that are "unbiological", such as proteins 
whose surface is constituted of a single amino acid, can therefore 
be found. To obtain an estimate of the property range 
corresponding to actual proteins, we locate on the sGRAVY- 
sSCE plane a set of 1,619 distinct monomeric proteins, i.e., 
proteins whose crystal contacts are not biologically-driven, 
reported in the PDB that (as of October 2013) were high- 
resolution (<2 A) and had less than 90% sequence similarity. As 
shown in Figure 6, these proteins cover only a small region of the 
sGRAVY-sSCE plane, locating only two biologically-relevant 
high-propensity regions. First, a high propensity region spans Q3 
and Q4, corresponding to proteins with lower than average sSCE 
and moderate hydrophobicity (green circle). Second, a series of hot 
spots are detected in Q^l (fuchsia circle). The landscape roughness 
of this second region, however, suggests that sSCE and sGRAVY 
alone do not fully characterize its properties. Additional structural 
variables need to be considered. 

In order to get a clearer physico-chemical understanding of the 
high crystallization propensity regions, we use the GPR model to 
predict C for each of the 1,619 PDB proteins described above, 
sorted according to the sGRAVY-sSCE quadrant to which they 
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Figure 5. A: Scatter plot of the regression residuals as a function of the observed output function /obs for the complete GPR model (blue circles), the 
reduced GPR model with only residue categories as variables (green triangles), and the standard LR model (red squares). B: Histogram of the residuals 
in A that quantifies the quality of the GPR fits. 
doi:1 0.1 371 /journal.pone.01 01 1 23.g005 



belong (46% in Q,l, 5% in Q2, 19% in Q3, and 30% in Q_4). This 
richer dataset provides a stronger signal than the nesg database 
alone, but relies on the assumption that these 1,619 proteins do not 
deviate too strongly from the nesg protein set. In support of this 
assumption, we note that the distribution of protein properties of 
the two sets do not significantly differ. From the quadrant analysis, 
we obtain the distribution of surface residues for proteins that are 

predicted to be easy (^> {75% = 0.1) and hard (€<f25% = 0-01) to 
crystaUize within different quadrants, save for Q2, which is too 
sparsely populated. (Thresholds with a tUde refer to the 
distribution of modeled crystallization propensity for the set of 
1,619 PDB proteins and not to results of the nesg dataset.) 




sGRAVY 

Figure 6. Projection of the average predicted propensity c over 
the sGRAVY-sSCE plane. Blue lines divide the plane in four 
quadrants (Q1, Q2, Q3, and Q4) based on the average sGRAVY and 
sSCE of the proteins in the nesg database. Cyan pluses represent the 
projection of the 1,612 structures downloaded from the PDB, whose 
distribution is used to generate Tables 1 and 2. The green circle broadly 
indicates the region where the sSCE is the main driving force to 
crystallization, while the fuchsia circle indicates the region where 
specific chemical complementarity plays a more significant role. The 
model also captures the reduction in propensity associated with limited 
protein solubility at very low sSCE. 
doi:1 0.1 371/journal.pone.01 01 1 23.g006 



Koknogorov-Smirnov tests determine whether the distributions of 
protein properties are significantly different (p-value<0.01) 
between easy and hard to crystaUize proteins within a given 
quadrant (Table 1). As expected, in Q_3 hydrophobicity emerges as 
the major drive to crystallization. High propensity proteins are also 
depleted in small residues compared to their recalcitrant 
counterparts in the same quadrant, as found above. Because 
these proteins are very hydrophobic, low sSCE results in proteins 
that are insoluble. Easy to crystaUize proteins in Q^l are enriched 
for polar residues and for pairs of side chains that involve polar 
residues. These proteins thus likely rely on electrostatic interac- 
tions to form some of their crystal contacts. Surprisingly, they are 
also depleted in negative residues, which, as discussed above, may 
be an artifact of the choice of solution conditions. 

Our analysis suggests that two distinct physico-chemical 
mechanisms drive crystallization, depending roughly on whether 
a protein has lower or higher than average sSCE. The first 
mechanism is based on limited interference from side chain 
entropy for hydrophobic interactions, and the second relies on the 
formation of specific, complementary charge and polar interac- 
tions. To test for the presence of these distinct crystaUization 
mechanisms in an independent way, we analyzed protein crystal 
contacts (see definition in Methods). These contacts carry the 
structural signature of the interactions that drive crystal formation. 
A Kolmogorov-Smirnov test revealed marked differences in 
residue and residue pair distributions for proteins belonging to 
dUferent quadrants, especiaUy between the low-sSCE proteins in 
0,1 and the high-sSCE proteins in Q3 and Q4 (Table 2). Q.1 
proteins are enriched for charged and polar residues, while Q3 
and Q4 proteins are enriched for smaU and hydrophobic residues. 
Q3 and Q4 can be further distinguished from each other by the 
higher frequency of hydrophobic (in Q3) vs polar (in Q4) residues 
present in crystal contacts. Given these results we confirm that Q,l 
proteins crystallize mostly using complementary electrostatic 
interactions (enriched for positive-negative pair residues), Q3 
using backbone-backbone (enriched for smaU-small pair residues) 
and hydrophobic interactions, and Q4 using both backbone- 
backbone and polar interactions. The fact that such a distinction 
can be made is particularly remarkable because different crystal 
contacts of a protein may, in general, involve more than one type 
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Table 2. Kolmogorov-Smirnov test for crystal contact distributions. 
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+, - ,++,+ - ,+P,+H, - - , - P, - H 


+,-,-H-,+ -,+P,+H,- - 


-,-P,-H 


Q3 


S,H,SS,S+,S-,SP,SH,PH,HH 




H.SS.HH 




Q4 


S,SS,S+,S-,SP,SH,PP 


+,-,P,++,+ -,+P,-P,PP 







Enrichment for specific properties of the crystal contacts of the proteins belonging to different sSCE-sGRAVY quadrants. For example, position (Q1,Q3) lists the 
properties for which proteins in Q1 are enriched compared to proteins in Q3. Note that the pairs here indicate interactions between residues on different chains rather 
than neighboring residues on the same chain. Symbols as in Fig. 4. 
doi:10.1371/journal.pone.0101123.t002 



of interactions [30] , which, through averaging, should weaken tlie 
statistical signature of this effect. 

From the analysis above, we note (i) that reducing sSCE is not 
the only pathway to generate less recalcitrant mutant, and (ii) that 
proteins with low sSCE are not necessarily easier to crystallize. 
Our findings suggest that, depending on the level of sSCE, 
proteins crystallize using two different sets of physico-chemical 
mechanisms. At high sSCE, crystallization relies mostiy on the 
enthalpic gain of forming favorable electrostatic interactions, such 
as salt-bridges or polar interactions; at low sSCE, the reduced 
entropic cost of freezing small residues as well as the hydrophobic 
effect appear to be the driving forces. For the latter group of 
proteins, if reducing solubility is necessary to form protein crystals, 
the mutagenic strategy proposed by SER is more likely to be 
successful. For proteins with higher sSCE, however, too many 
mutations may be necessary to reach a range of sSCE that 
promotes crystallization. Mutating a few selected residues that can 
trigger electrostatic interactions may then be a more effective 
strategy. For example, replacing small residues with polar residues 
or mutating charged side chains that are found next to oppositely 
charged side chains could help promote inter-protein electrostatic 
interactions. 

GPC: Solution conditions for different crystallization 
mechanisms 

In order to test whether the two main crystallization mecha- 
nisms identified above are optimized by distinct sets of solution 
conditions, we trained four separate GPC models on the nesg 
dataset. Separately for Q,l and Q_3, we considered proteins with 
higher ((^> 1^50% =0.04) and lower ((^<(^5o%) than average 
propensity. We were particularly interested in the solution 
conditions that help crystallize recalcitrant proteins from the Q,l 
(electrostatic mechanism) and the Q3 (SER/hydrophobic mech- 
anism) quadrants. The relative abundance of statistically signifi- 
cant solution properties in the Q_l model for recalcitrant proteins 
indicates that the response of Ql proteins to changes in solution 
conditions is more complex than that of Q3 proteins. This 
observation is consistent with chemical interactions in Q^l being 
more heterogeneous and thus responding to specific solution 
conditions. In this context, the ionic strength and the presence of 
high-valency ions (with charge +2 or +3) seem to play 
particularly important roles (Table 3). The capacity of certain 
high-valency ions to coordinate proteins at crystal contacts [43] or 
to bind proteins active sites may pardy explain this sensitivity. 

Even more compelling evidence for the heightened importance 
of the crystallization conditions in Q^l compared to Q3 is the 
crystallization response to changes in solute concentration (Fig. 7). 
Trained GPC models allowed us to extend the results reported in 
the NESG database and to explore how the probability of successful 
crystaUization, n, is affected by the solution features. Although 



predictions for solution conditions that are very different from 
those experimentally tested have a high uncertainty and are 
essentially meaningless, for conditions similar to those used in the 
training set, the model may actually enrich that information. We 
specifically consider the predictions based on the molarity c, of a 
given additive / (see Methods). As a first approximation, we fit each 
of these trends to a second-order polynomial 
n{Cj) = aQ-\-aiCj + a2cj. By definition, Oq is the probability that 
the protein crystallizes without additives, while ai and 02 qualify 
the crystallization probability dependence on the additive 
concentration. The behavior of the model can be visualized as a 
scatter plot where each symbol represents the response of a given 
protein to a given salt (Fig. 7). (The lower-left quadrant is empty 
because monotonically decreasing trends are excluded from this 
analysis.) The upper-left quadrant corresponds to proteins whose 
probability to crystallize is highest at intermediate concentration, 
and the lower-right quadrant to those that crystallize more easily 
under either very high or very low additive concentrations (insets 
in Fig. 7). (Recall that the underlying data is based on the visual 
observation of crystals in samples where one or more crystaUiza- 
tion conditions out of typically many has proceeded to provide 
structural data. Not every crystal in every experiment is examined 
by X-ray (or UV imaging) and it is thus possible that crystals 
forming in high salt conditions are actually salt crystals.) Additives 
to Ql proteins span the upper-left quadrant fairly broadly, which 
suggests a high heterogeneity in optimal additive concentrations. 
Most additives to Q3 proteins tend to have small values of 02 and 
positive ai, corresponding to a nearly linear response. Interest- 
ingly, many more salt types result in non-decreasing trends for Q^l 
than for Q3 proteins, i.e., there are more blue than red symbols. 
Tuning salt type and concentration is thus likely to be a more 
effective strategy to crystallize proteins in Q^l (electrostatic 
mechanism) than those in Q3 (SCE/hydrophobic mechanism). 
In other words, for Q^l proteins tuning crystallization conditions 
should be sufficient to obtain a crystal, whereas a target protein 
belonging to Q3 may need more invasive mutagenic approaches if 
it does not crystallize from standard screens. Reciprocally, if a Q3 
protein crystallizes in a standard screen, it is likely to produce hits 
in several conditions. Interestingly, the two highly-crystallizable 
NESG proteins, which crystallized in more than 30% of conditions 
(PDB ids: 20YR and 2PGX, Fig. 1), belong to this category. 

The model's optimal salt and PEG concentration for each hard 
to crystallize protein in Q^l and Q3 varies by quadrant (Fig. 8). For 
Q3 proteins, adding salt rarely improves the probability to 
crystallize (4 out of 42 proteins) and, when it does, high salt 
concentrations are preferred. By contrast, the crystallization 
probability of more than 40% of Q_l proteins is improved by the 
presence of salt (at concentrations between 0. 1 and 1 M). Once 
again, these predictions support an electrostatics-dominated 
mechanism for Q_l. Most of the optimal conditions for these 
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Table 3. Significant experimental conditions for hard-to-crystallize proteins belonging to Ql and Q3. 



Q1, i<t5m 




Q3, ^<C5o% 




property 


/ 


property 


1 


IS 


1.00 


Cadmium 


1.00 


PEG 10000 


1.00 


Nickel 


1.00 


Citrate 


1.00 


Cesium 


1.00 


PEG 5000 


1.00 


Pyrophosphate 


1.09 


Strontium 


1.00 


Succinate 


1.11 


Nickel 


1.03 


Iodide 


1.20 


Tetraborate 


1.04 


Barium 


1.45 


PEG 200 


1.04 


Fluoride 


1.47 


Tartrate 


1.05 


Samarium 


1.73 


Triphosphate 


1.05 


Copper 


1.84 


Samarium 


1.06 


PEG 200 


1.84 


Copper 


1.12 


Tetraborate 


1.87 


Iron 


1.13 


Iron 


1.99 


Iodide 


1.19 


PEG 2000 


2.21 


Barium 


1.49 


Triphosphate 


2.34 


Cesium 


1.64 


PEG 5000 


2.64 


Pyrophosphate 


1.68 


Zinc 


3.13 


Cadmium 


1.72 


4-aminosalycilate 


4.00 


PEG 1500 


2.49 


Gadolinium 


6.38 


PEG 550 


2.56 


Cacodylate 


7.44 



Most significant condition properties for hard to crystallize proteins belonging to Ql and Q3. Properties are colored according to their classification: cation (bold), anion 
(italic), PEG and others (regular). 
doi:l 0.1 371 /journal.pone.Ol 01 1 23.t003 



proteins cluster by salt type (contiguous patterns along the 
horizontal axis). Because salt types are ordered by cation, the 
clustering of the results suggests that the crystallization of Q,l 
proteins is more sensitive to the type of cation than to the type of 
anion. PEG also results in distinct crystallization patterns for Q^l 
and Q3 proteins. The former prefer high concentrations of large 
PEG molecules, while the latter heterogeneously respond to the 
presence of PEG, both size- and concentration-wise. These results 
suggest that the successful crystallization of Q^l proteins requires a 
wide sampling of salt types (specifically cations) and concentra- 
tions. For these proteins, it may thus suffice to tune the 
crystallization conditions without resorting to mutagenesis. By 
contrast, tuning the type and concentration of PEG appears to be 
more effective for Q3 proteins, which are, however, generally less 
sensitive to solution conditions. Mutations, such as those suggested 
by SER, may then be necessary to promote crystallization. 

Conclusions 

Using state-of-the-art statistical techniques on a detailed 
database of protein crystallization experiments coupled with 
extensive information on those proteins and their resulting 
structures, our study recapitulates, with a single model, many 
physico-chemical factors that independent studies have related to 
crystaUization propensity, and detects the correlations between 
these variables. In addition, our model distinguishes two main 
mechanisms that drive monomeric protein crystal assembly. One 
is mainly entropic and exploits low side chain entropy and 
hydrophobicity; the other is energetic and relies on complemen- 
tary electrostatic interactions. The key contribution from electro- 



static interactions provides further evidence that crystal contacts 
have a specific physico-chemical signature even if they are not 




Figure 7. Scatter plot of polynomial parameters that charac- 
terize the non-monotonic trends in crystallization probability 
with concentration for different salts. Proteins belonging to Q1 are 
represented by blue pluses and those belonging to Q3 by red crosses. 
The insets sketch the probability trend as a function of salt 
concentration for different combinations of a\ and cij (fl2<0 and 
a\ >0 vs 02 >0 and a\ <0). 
doi:1 0.1 371/journal.pone.01 01 1 23.g007 
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salt type PEG type 



Figure 8. Optimal conditions for low crystallization propensity proteins in Q1 (upper side) and Q3 (lower side) for various salt types 
(molar concentration) and PEG types (% mass concentration). Proteins are ordered by sSCE; salt types are ordered by cation (ammonium, 
calcium, lithium, magnesium, manganese, potassium, rubidium, sodium, zinc, barium, cesium, cobalt, copper, iron, gadolinium, nickel, samarium, 
strontium, cadmium) and, within each cation, by anion (citrate, malonate, succinate, tartrate, acetate, bromide, cacodylate, carbonate, chloride, citrate 
tribasic, fluoride, formate, iodide, molybdate, nitrate, phosphate monobasic, phosphate dibasic, phosphate tribasic, pyrophosphate tetrabasic, sulfate, 
tetraborate, thiocyanate, thiosulfate, 4-aminosalicylate); and PEG types are ordered by molecular weight (200, 400, 550, 1000, 1500, 2000, 3350, 4000, 
5000, 6000, 8000, 1 0000, 20000 g/mol). 
doi:1 0.1 371 /journal.pone.01 01 1 23.g008 



biologically functional [30,42,44—47]. These interactions are 
indeed of the same nature as those that traditionally result in 
specific and thus biologically relevant interactions, such as protein 
complex assembly or protein-target recognition. The knowledge 
accrued over the years for these interactions [48] may thus be 
useful for understanding and designing crystal contacts [33,49]. 

The GP-based models developed in this study also estimate the 
crystallization propensity of any protein, given a set of its physico- 
chemical properties, and identify mutagenesis strategies that are 
more likely to yield protein crystals. For example, we find that it 
may be favorable to mutate positive-negative surface residue pairs 
to uncharged residues or small residues to polar ones, in order to 
crystallize a recalcitrant Q^l protein, whereas SER guidelines may 
be more useful for crystallizing Q3 proteins. In addition, using 
data from crystallization screens, an improved set of solution 
conditions can be determined given some of the protein surface 
properties. For example, fine-tuning salt concentration and cation 
type appears to be an effective strategy for proteins with higher 
than average sSCE. In contrast, using a high salt concentration 
and the addition of PEG appear to be more effective approaches 
for crystallizing proteins with lower than average sSCE. 

Although our analysis cannot be directly applied to de novo 
protein crystallization, a coarse Q,l/Q3 classification may still be 
possible based on a protein's average SCE, which linearly 
correlates to its sSCE and can be determined from the primary 
structure. This approximate assignment may narrow down which 
one of the two main crystallization approaches is more likely to be 
successful. More precise and complete structural information, e.g., 
residue types and pairings, could also be obtained by combining 
different (imperfect) protein folding algorithms [50]. For example, 
relatively precise estimates of sSCE can be calculated from 
available computational tools, such as PredictProtein [51]. It 
should thus be possible to compute from sequence information 
alone what residues are likely to be exposed and, consequently, to 
estimate the protein properties that the GPR and the GPC models 
need to predict its crystallization propensity and optimal 
crystallization conditions. Future studies will integrate the current 
models with algorithms that estimate these properties, and assess 
their experimental success. 

Finally, the accuracy of any statistical model depends on the 
quantity and quality of the training set. Our findings emphasize 
the need for an increased availability and standardization of 



protein crystallization datasets [8] . A richer characterization of the 
experimental outcomes would also extend the reliability of these 
models. For example, different successful crystallization conditions 
can yield distinct crystal forms and thus crystal contacts for the 
same protein. The availability of crystal symmetry and contact 
information for different conditions would refine our understand- 
ing of the correlation between experimental conditions and the 
(solution mediated) protein-protein interactions that drive crystal- 
lization. Similarly, unsuccessful conditions could be defined more 
finely depending on whether a protein remained soluble or gelled. 
Interpreting this data in light of phase diagrams would further 
clarify the physico-chemical basis for protein crystallization and 
guide future experiments. It is thus reasonable to anticipate that 
the extension of statistical models and the increased availability of 
training datasets will help guide biomolecular crystallization 
toward a more rational basis. 

Methods 

Data 

The crystallization database reports binary crystallization 
outcomes in 198 samples of 182 unique proteins from the nesg 
(list of PDB IDs in Materials SI) each in 1,536 solution conditions 
in microbatch under-oU experiments conducted at the Hauptman- 
Woodward Medical Research Institute High-Throughput Crys- 
tallization Screening (HTS) laboratory. The concentration of the 
various chemicals, proteins, and pH are reported. The solution 
conditions span six generations (generations 5 to 9) of the cocktails 
used in the HTS center with approximately half the conditions 
representing commercially available crystallization screens and the 
other half an incomplete factorial sampling of chemical space [6] . 
Most experimental conditions fall into two categories: moderate to 
high salt alone, and low salt with PEG representing typical 
crystallization strategies. Although a total of 311 different 
chemicals are used, we focused on the effect of ions (divided in 
19 cations and 24 anions) and 13 types of PEG for a total of 56 
analyzed chemical species. The pH distribution is slightly biased 
towards lower values (mean pH = 6.8). The chemical species 
concentrations are combined to obtain the ionic strength of the 
solution (IS), a Hofmeister series coefficient (HS,, for anions and 
HSc for cations) and a depletion effect coefiicient (DEP), 
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IS=^^c,.Z? (1) 

leions 



fecations 



HS,= J2 (3) 

/eanions 



DEP= E |k(^. + ^')'= E t|2(^. + ^/^V, (4) 

where c, is the species concentration, Zj the ion charge, Rp is the 
solvated protein radius of gyration, M,- is the PEG molecular mass 
[52-55], and hs,- is a Hofmeister index that ranks the species from 
more to less kosmotropic (cations: ammonium, cesium, rubidium, 
potassium, lithium, sodium, barium, magnesium, manganese, zinc, 
cadmium, calcium, cobalt, copper, nickel, strontium, iron, 
gadohnium, samarium; anions: triphosphate, tricitrate, sulfate, 
tartrate, carbonate, thiosulfate, diphosphate, succinate, citrate, 
acetate, malonate, fluoride, formate, chloride, bromide, iodide, 
monophosphate, thiocyanate) [56]. 

It is important to note that the data comes from samples that 
produced hits in crystallization screening and then went on to yield 
a structure deposited in the PDB. Crystal hits that yielded no 
structural data are beyond the scope of our analysis. For the 182 
proteins studied, 29% gave hits in ten or fewer of the 1,536 
different chemical cocktails but 18% gave hits in 100 or more 
cocktails. Typically only the best set of initial conditions go 
forward to optimization, hence we have no data on how well a 
crystal may have diffracted when grown in one of the other 
solutions. In this analysis we also give equal weight to each 
crystallization hit, which introduces additional noise in the data. It 
is also important to note that the protein samples were all prepared 
in a common buffer, which reduces the number of solution 
variables. 

PyMol was used to determine the structural characteristics of 
each protein from its PDB structure: the fraction of the protein 
surface carrying each residue (a residue was considered exposed if 
at least 2.5 of its surface is exposed), the solvent accessible 
surface area (SASA), the radius of gyration, and the isoelectric 
point (pi). Global and surface values for the grand average of 
hydropathicity index (GRAVY) (measure of hydropathy) [57], the 
polarity (POL) coefficients [20], and the side chain entropy (SCE) 
[20,58] were obtained by averaging the value for each residue, 
respectively in the protein and on the protein surface (Fig. 1). Note 
that we defined the magnitude of sSCE such that more flexible 
residues have a higher sSCE, which is opposite to the definition of 
Ref [58]. 

The residues were clustered in categories: small (G, A), 
positively charged (H, R, K), negatively charged (D, E), polar 
(C, S, T, N, Q) and hydrophobic (L, I, V, F, Y, Ivl, W, P). To 
incorporate the first many-body correction, we also determined 
the number of neighboring (within 5 A of each other) residue 
categories (small-small, small-polar, and so on) normahzed over 
the total number of neighboring pairs. These variables were used 
both in absolute number and weighted by their solvent accessible 
area, because more exposed residues may play a larger role than 
less exposed ones in protein crystalUzation. Note that these 



predictive variables are not all independent and some have to 
satisfy certain constraints. In particular, the surface fraction 
covered by each amino acid type has to sum up to 1, and, given 
the surface amino acids, sGRAVY, sSCE, and sPOL are uniquely 
determined. 

Combining this information generates two sets of data. The first 
associates a crystallization propensity (fraction of successful 
experiments) to each protein characterized by 89 protein features 
(Fig. 1). The second reports the success or failure of each 
experiment for each protein (254,623 experiments in total) 
characterized by the solution conditions (61 cocktail features) 
and the protein features for a total of 150 predictive variables. The 
data is available upon request. 

Crystal contacts analysis 

Similarly to previous studies [20] , we defined crystal contacts as 
the regions on the proteins surface that are within 5 A from 
surface residues on a neighboring chain in the protein crystal. To 
identify the crystal contacts, we used PyCogent [59], whose 
structural biology tool-kit is an extension of PDBZen [20]. In- 
house Python scripts classified the properties of each crystal 
contact. 

Statistical model 

In standard linear and generalized linear models, the response 
variable y, whether continuous or discrete, is a function cr of a 
linear combination of the predictive variables x 

y = aix'^vf) + e, (5) 

where w indicates the weights of the variables and e is the 

uncertainty of the model. A non-linear dependence among the 
predictive variables cannot be captured by this framework. 
Gaussian processes discard the assumption of linearity and place 
a prior on any possible functional form, giving more flexibility to 
the model. In contrast to deterministic methods (such as Support 
Vector Machine), GP are Bayesian, which means that they assign 
a probability distribution to the response variable and provide a 
confidence interval on the predicted value. In the following, we 
briefly summarize GP regression and classification. More details 
can be found in Rcf [31]. 

In the simplest version of GP inference, the latent function /(x) 
replaces the linear dependency in Eq. (5). The prior on / is 

p{f\x)~NiO,K), (6) 

where N(0,K) indicates a multinormal distribution with zero 
mean and covariance matrix K. Among the several available 
options for K, we opt for the widely used squared exponential, so 
that 

K(Xi,Xj) = exp( - y(xi - x,)P(Xi - x,)^), (7) 

where y and the diagonal matrix P are (hyper-)parameters that 
have to be optimized. In particular, each element of the 
diagonal of P can be related to the typical length scale /,■ of 

variable ; as = . Large correspond to less important 
variables, while a small /,■ identifies a variable whose variation 
strongly affects the response variable. For the scope of this study, 
we arbitrarily defined variables to be significant if /< 100, which is 
roughly the half point between the largest and the smaUest length 
measured in logarithmic scale. 
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GPR. In GPR, the response variable is defined as / = / (x). 
The predictive probability over a test set Xtest, given a training set 

(Xtrain/train), is ^'(/testlXtest.Xtraini/train) ~ A^C/'r-ST), where 
f = m+ ^(Xtest>Xtrain)-K^(Xtrain>Xtrain)(/train ~ w) 



/"(/test |Xtrain>Xtest, Strain)— /'(/test|Xtest/)/'(/^|Xtrain.>'train)'|f ^ 



where the posterior distribution is i'(/|x,>')=/)(y|/)/'(/|x)//)(j'|x). 
Second, the probabilistic prediction is obtained 



K — -^(XtestjXtest) ~ ^(Xtest>Xtram)-K^(Xtram>Xtram)^(Xtram>Xtest)> 

in which m is the mean of the observed response variable over the 
training set. Sampling the predictive distribution provides predic- 
tions on the response variable /. 

The hyper-parameter selection is performed by optimizing the 
marginal log-likelihood 

loglpC/irainlXtrain.y,^)] = - ^-\f^^K-%^-log\K\- nloglnU^) 

where n is the sample size. By determining the gradient of the 
marginal log-likelihood, any conjugate gradient optimization 
method can be used to locate local maxima. The marginal log- 
likelihood was maximized by sampling />, according to a beta 
distribution with a shape parameter of 0.5. For each initial 
condition of the hyper-parameters, we performed a conjugate 
gradient search to identify the corresponding local maximum 
(Fig. 2A). 

We used GPR to study the protein crystallization propensity. 
Because the propensity ranges between 0 and 1 by definition, a 
brute-force regression is not appropriate {f's domain is the whole 
real line). A possible solution to the problem is to link 
crystallization propensity ^ and / using a sigmoidal function. 
The drawback of this approach is that very low propensity values, 
which are by definition affected by large relative uncertainty, 
correspond to large negative values of /. As a result, the inference 
process gives poorer predictions for proteins that are easy to 
crystallize. Because these proteins are of greatest interest to us, we 

opted instead for /= - — -. Although small nonphysical negative 

values of propensity are then allowed, this transformation is close 
to linear for small values of / and emphasizes the contribution of 
high-propensity proteins. In order to have all the predictive 
features on a similar scale (each of them spans very different ranges 
of values), we scaled each feature according to their mean and 
standard deviation (z-scores). Length scales /, then correspond to 
the actual significance levels of each property. 

The search for hot spots, which identifies features that maximize 
protein propensity, was performed by laying down a grid over the 
feature space with a fineness that depended on the length scale of 
the corresponding dimension. For variable with /, <100, four 
equidistant points in the physical range were used, and otherwise 
only the mid value was used. Although not exhaustive, trials with 
finer grids did not detect additional maxima. 

GPC. In GP binary (success/failure) classification, the prob- 
ability of success n is connected to the latent function / by 

7r(x) = 0[/-(x)], (9) 

where (f) is a sigmoidal function, such as logistic or probit. A 
prediction for n can be obtained in two steps. First, the distribution 
of the latent variable / over a test case has to be computed using a 
training set (Xtrain>>'tram)> where Xf^^ indicates the explanatory 
variables values in the set and Strain the corresponding success/ 
failure outcome. 




Unlike for GPR, these integrals cannot be simplified because of 
the non-Gaussian form of (j>. As a result, either analytical 
approximations or numerical methods must be used. In this 
study, the problem is further complicated by the large size of the 
dataset (each sample corresponds to a different experiment), which 
makes any computation involving the GP prior matrix P 
intractable. To bypass this problem, we adopted the sparse 
approximation method implemented in Ref [60] (Informative 
Vector Machine), which relies on incremental Gaussian approx- 
imations of the posterior distribution to provide parameter 
optimization for a probit GP classification. 

To determine the best classification model, we constrained the 
parameters of the protein properties to their GPR values, and 
maximized the marginal log-likelihood over the parameters 
corresponding to the solution conditions. The log-likelihood 
maximum search used a conjugate gradient algorithm starting 
from different initial values sampled according to a beta 
distribution with shape parameter 0.5. 

In the GPC analysis reported in the Results section, we focused 
on how n is affected by the concentration of each individual 
additive. In this case, for a given additive / of concentration c, , we 
determined 7c(c,) = 7i(x,), where the solution feature vector x,- 
corresponds to a condition with neutral pH, additive i concentra- 
tion set to Cj, and all the other additive concentrations set to zero. 
The ionic strength (IS), Hofmeister series parameters (HS,, and 
HSa), and the depletion parameter (DEP) were then determined 
given additive fs properties and concentration as defined in 
Equations (1), (2), (3), and (4). 

Supporting Information 

Materials SI List of the PDB ids of the proteins used in 
this study. 

(PDF) 
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